drawing

Week 8 - Support Vector Machines

Dr. David Elliott

  1. Imballanced Data

  2. Multi-Class

  3. Strengths and Limitations

Notes

6. Imballanced Data

Class imbalance is a quite common problem when working with real-world data.

This occours when examples from one class or multiple classes are over-represented in a dataset (e.g. spam filtering, fraud detection, disease screening).

Dataset Example: Default

Default: Customer default records for a credit card company

"We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance."$^5$

Notes

Notes

Notes

We'll do a random search and we can find a model with high accuracy.

However, this is not much better than a completely useless model that only predicts "No".

Notes

This binary classifier can make two types of errors$^5$:

While the overall error rate is low, the error rate among individuals who defaulted is very high.

Notes

Error and Accuracy$^1$

Gives general performance information regarding the number of all correct or false predictions comparative to the total number of predictions for both the positive and negative labels.

$$ \begin{align} ERR &= \frac{FP+FN}{FP+FN+TP+TN} \\ \\ ACC &= 1-ERR \end{align} $$

Precision (PRE)$^1$

$$ PRE = \frac{TP}{TP+FP} $$

Recall (or True Positive Rate)$^1$

$$ REC = \frac{TP}{FN+TP} $$

F1-score$^1$

$$ F1 = 2\left(\frac{PRE \times REC}{PRE + REC}\right) $$

We can use a classification report, which gives more information such as the macro avg and weighted avg.

Macro Average

Weighted Average

Notes

Notes

Potential Reasons for Poor Performance

Training Error > Test Error$^5$

Notes

Optimising for Accuracy

During hyperparamter cross-validation we are choosing the model with the best overall accuracy.

This gives us a model with the smallest possible total number of misclassified observations, irrespective of which class the errors come from$^5$.

ML algorithms typically optimize a reward or cost function computed as a sum over the training examples, the decision rule is likely going to be biased toward the majority class$^9$.

Notes

Potential Solutions for Poor Performance

There are a number of methods available to address imbalances in a dataset, such as:

  1. Stratify the k-fold,
  2. Weighting the classes in the model during training,
  3. Changing the training metric,
  4. Resampling the data.

Stratified k-fold

Some of the folds may not have the same amount of data in, so the validation error we get from models may be a poor estimate of performance.

Weights

During model fitting we can assign a larger penalty to wrong predictions on the minority class.

The heuristic used for class_weight="balanced" in Scikit-Learn (0.23.1) is:

$$ \frac{n}{Nc \times \sum\limits^n_{i=1}I(y_i \in S)}, $$

where $n$ are the number of samples, $Nc$ the number of classes, $I$ is an indicator function, and $S$ contains the class elements.

Extra

So far in these notes we have been using a standard classifcation table fom Scikit-Learn, however we may wish instead to use one more suited to imballanced data.

TODO

Notes

Changing Training/Validation Metric

Changing the metric for what is defined as the "best model" can help us prioritise models that make particular errors.

For example, a credit card company might particularly wish to avoid incorrectly classifying an individual who will default, whereas incorrectly classifying an individual who will not default, though still to be avoided, is less problematic.

In this case, recall would therefore be a useful metric to use.

Notes

Extra

This is just some additional visualisations.

Providing your comfortable using metrics instead of relying on a confusion_matrix, you can use more of your training data by just using the multiple metrics in the CvGridSearch.

From here on in, I will get rid of my separate training and validation sets and I will just use "recall" as our metric of interest.

Resampling

We can change the distribution of the classes in our training data.

Under-Sampling

A fast way to balance the data is just to randomly select a subset of the data for each class so they have the number of datapoints found in the smallest class.

Notes

Note

Oversampling

Data can be oversampled easily by randomly sampling from minority classes with replacement to duplicate original samples.

Notes

We can see if a RBF improves things although, if you plan on running this yourself (overwrite=True), this is computationally expensive.

Extra

NearMiss

A number of undersampling methods use heuristics based on k-nearest neighbors (KNN) classification8. KNN finds a number of samples that are the most similar to a data point we want to classify, based on a given distance metric, with its assigned class label depending on a majority vote by the nearest neighbours9 (we'll come back to this later). NearMiss uses this by selecting samples in the class to be under-sampled where the average distance to the closest or farthest samples of the minority class is smallest10.

NeighbourhoodCleaningRule

Undersampling techniques also include data cleaning rules, where the number of samples in classes are not specified, but data is edited based on methods such as removing data dissimilar to their neighbourhood11 or by removing one or both samples in different classes when they are nearest neighbors of each other12.

ADASYN and SMOTE

Instead of just randomly oversampling there are also available approaches that generate new samples through the use of interpolation, such as SMOTE and ADASYN. However these methods can generate noisy samples so cleaning rule can be applied after oversampling13.

Best Approach?

There is no one best approach, its typically dependent on the data and the aims for the model.

Below are examples of cross-validation scores for the best models (according to recall) for the different approaches.

Notes

Using the figure above, for the client who wants the model to prioritise avoiding incorrectly classifying an individual who will default, we would probably choose the undersampled linear SVM.

As we can see on the test set, we get similar scores as we did on the validation.

Notes

Extra: Improving the model using more features

Would adding in if someone is a student improve the model?

Note

Note

So in this case it does not seem to improve the metric of most interest (recall), although did improve precision at the expense of recall.

Fairness

Imagine we used the model in practice, and those deemed more likely to default were given more unfavourable terms due to them being seeming more risky according to the algorithm?

Imagine adding their student status had improved our model. Would it be fair to judge their chances of defaulting based on their student status rather than their actual financial information?

Maybe we need to know more about the people applying for the loan, but what variables do we use?

This is a difficult but important aspect of ML and I leave it here to make you think about the possibilities. We'll be exploring this more in the last week.

Notes

7. Multi-Class

Some models (e.g. tree-based classifiers) are inherently multiclass, whereas other machine learning algorithms are able to be extended to multi-class classification using techniques such as the One-versus-Rest or One-versus-One methods3.

One-verses-all (or One-vs-the-rest)

The One-verses-all approach is were you train a classifier for each class and select the class from the classifier that outputs the highest score$^3$.

In other terms, if we fit $K$ SVMs, we assign a test observation, $x^*$, to the class for which $\beta_{0k} + \beta_{1k}x^*_1, ...,\beta_{pk}x^*_p$ is largest (the most confident)5.

Advantage: As each class is fitted against all other classes for each classifier, it is relatively interpretable$^{14}$.

Disadvantages: Can result in ambiguious decision regions (e.g. could be class 1 or class 2), and classifiers could suffer from issues of class imballance$^4$.

Notes

OneVsOne

Another strategy is to use a OneVsOne approach.

This trains $N \times (N-1) / 2$ classifiers by comparing each class against each other.

When a prediction is made, the class that is selected the most is chosen (Majority Vote)$^3$.

Advantage: It is useful where algorithms do not scale well with data size (such as SVM), because each training and prediction is only needed to be run on a small subset of the data for each classifer$^{3,14}$.

Disadvantages: Can still result in ambiguious decision regions and be computationally expensive$^4$.

Notes

Multiclass scoring metrics$^9$

scikit-learn implements macro and micro averaging methods to extend scoring metrics to multiclass problems.

The micro-average is calculated from each TPs, TNs, FPs, and FNs of the system.

For example, the micro-average precision score for a $k$-class system is,

$$ PRE_{micro} = \frac{TP_1+\ldots+TP_K}{TP_1+\ldots+TP_K+FP_1+\ldots+FP_K}. $$

This is useful when we want to weight each instance or prediction equally.

The macro-average is the average scores of the different systems:

$$ PRE_{macro} = \frac{PRE_1+\ldots+PRE_K}{K}. $$

This is useful when we want to evaluate the overall performance of a classifier with regard to the most frequent class labels.

8. Strengths and Limitations

There are always advantages and disadvantages to using any model on a particular dataset.

Advantages

  1. Kernels allow us to constuct hyperplanes in high dimensional spaces with tractable computation$^6$
    • SVMs allow for complex decision boundaries on both low-dimensional and high-dimensional data (i.e., few and many features).
  1. SVMs always converge on the same answer given identical data and hyper-parameters.
  1. Kernelized support vector machines perform well on a variety of datasets$^{16}$.
  1. By softening the margin using a budget (or cost) parameter (C), SVMs are relatively robust to outliers$^{17}$.
    • Provided hyper-parameters are carefully selected, compared to some ML methods (e.g. Trees), SVM's are less likely to be effected by an over-representation of data in the training phase due to over-parameterization or over-fitting.

Disadvantages

  1. Non-linear SVM's do not scale very well with the number of samples.
    • However, there are batch algorithms which do improve this$^{11}$.
  1. SVMs require careful preprocessing of the data$^{16}$.
    • Tree-based models require little or no pre‐processing.
  1. SVM models are hard to inspect$^{16}$.
    • It can be difficult to understand why a particular prediction was made.
    • It might be tricky to explain the model to a nonexpert.
  1. Discrete data typically presents a problem for SVMs.
    • However, there are alternative implimentations in the litriture to handle discrete data$^{10}$.
  1. They are comparatively sensitive to hyperparamters, there can be quite a variation in score depending on setup.
    • The choice of kernel for example can change the results considerably$^{15}$.
  1. SVMs only produce predicted class labels (and their "confidence").
    • Obtaining predicted class probabilities requires additional adjustments and computations not covered here.

References

  1. https://scikit-learn.org/stable/datasets/toy_dataset.html
  2. Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)
  3. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  4. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  5. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
  6. http://contrib.scikit-learn.org/imbalanced-learn/stable/introduction.html
  7. https://beckernick.github.io/oversampling-modeling/
  8. Mani, I., & Zhang, I. (2003, August). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126).
  9. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
  10. Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559-563.
  11. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3), 408-421.
  12. Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769-772.
  13. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
  14. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html
  15. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
  16. Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. " O'Reilly Media, Inc.".
  17. https://bradleyboehmke.github.io/HOML/svm.html